Course overview

Author
Affiliation

Dave Clark

Binghamton University

Published

January 19, 2025

About this class

All models are wrong \(\ldots\)

“All models are wrong, some are useful.” Box & Draper (1987)

This course is based on the principle that to build useful models, we need:

  • an intuitive understanding of regression models
  • data science skills
  • careful and creative thinking about politics
  • a general skepticism of individual models, but an optimism about the modeling enterprise

The regression model

This course focuses on the linear regression model, but with a heavy emphasis on data science skills like data management/wrangling, and coding.

Course Goals

In May, you should be able to:

  • develop and test hypotheses about politics.

  • estimate, evaluate, and interpret basic linear and nonlinear regression models.

  • evaluate, understand, clean, wrangle, join, and reshape data.

  • write R code, to manage data, implement simulation methods, estimate models, visualize data/predictions.

  • understand the data generation process as a motivation for modeling.

This class

Is not a math class,

but one that uses basic math toward an applied goal.

This class

  • is aimed at application; the more you struggle with code, the more you’ll learn.

  • the more you use other people’s code, the less you’ll learn.

  • the more you look at other people’s code, the more you’ll learn. You should live on stackoverflow etc.

  • is applied - if you apply the tools we cover, you’ll learn a lot; if not, you won’t.

  • the students who do best later on in this program challenge themselves here.

This class

  • is a collaborative effort among you, and with me and Oguzhan. We learn from each other when we collaborate.

  • This means having honest conversations in class about what we do and do not understand.

  • It means working together to learn R, to wrangle data, and to understand models; working together on your assignments.

  • the most successful cohorts are those that work collaboratively; this does not mean free-riding, but struggling with everything together.

What do you need to know for 501?

What do you need at the start?

Matrix algebra

You should be able to make sense of this in terms of dimensions and operations:

\[\beta = (X'X)^{-1} X'y\] where \(X\) is a data matrix of rows and columns; these are the variables in a regression. \(y\) is the outcome variable, the same number of rows as \(X\). Define the dimensions of \(X\), \(y\), and \(\beta\), of \((X'X)^{-1}\) and \(X'y\).

Probability theory

  • what are probability distributions?
  • what is the PDF of a variable?
  • what is the CDF of a variable?

Understanding Regression models

The class focuses on regression models of the general form (scalar notation):

\[ y_i = \beta_0 + X_1\beta_1 + X_2\beta_2 \ldots + X_k\beta_k + \epsilon\]

or in matrix notation,

\[ \mathbf{y} = \mathbf{X} \mathbf{\beta} + \mathbf{\epsilon} \]

An outcome, \(y\) is a function of some combination of \(X\) variables, their coefficients (\(\beta\), including an intercept), and an error term.

Regression models

Posit a statistical relationship between:

  • an outcome variable, \(y\).
  • a covariate of interest, \(x\).
  • a set of controls, \(\mathbf{X}\).

The effect of \(x\) is denoted \(\beta\). It is the partial effect of x on y, because removed from that effect are the effects of x -> X -> y. This is what we mean by controlling for the effects of X. Much more on this later.

Regression models

Can be estimated using different technologies including:

  • Least Squares
  • Maximum Likelihood
  • Bayesian methods

This course is about the technology of least squares; we will deal with one ML regression model (logit) later in the semester.

Illustration

Models to understand politics

\(~\)

This a methods class, but the methods are only meaningful to us insofar as they help us understand politics. So let’s motivate the class with a question about politics.

Careful thinking

Correlation does not imply causation.


Repression during the COVID-19 Pandemic

Among the products of the pandemic is an opportunity for governments, under the guise of protecting public health, to repress citizens.

There is some evidence of democratic backsliding prior to the pandemic, but the global public health crisis gave governments a specific and universal opportunity to take advantage of this softened ground and to become more authoritarian.

We’re going to examine the covariates of violence against civilians during the COVID period, focusing on how states of emergency create changes in government violence.


Data

Let’s look at ACLED event data, and data on states of emergency during the COVID period from Lundgren et al (2020).

The next figure plots protests and repression since mid-2019, the onset of the pandemic marked at March 15, 2020.

The blue lines are local regressions showing trends; while protests return to pre-pandemic levels (similar to the years prior), repression increases since the onset of COVID-19.


Protests

code
acled<-read.csv("/Users/dave/Documents/2013/PITF/analysis/protests2021/data/2022analysis/acled2022.csv")

##Event_date variable has a "Chr" format. 
#Need to convert it into date format to calculate 7 day averages for Protests and Violence against Civilians

acled$event_date<-as.Date(acled$event_date,format="%d %B %Y")
#class(acled$event_date)

##Compute number of events per day

acled$perday<-1

protests<-filter(acled,event_type=="Protests" )
stateviol<-filter(acled, event_type=="Violence against civilians")
protests<-aggregate(perday ~ event_date, data = protests, sum)
stateviol<-aggregate(perday ~ event_date, data = stateviol, sum)


##Calculating a 7 day moving average
protests$rollmean<-rollmean(protests$perday,k=7, fill=NA)
stateviol$rollmean<-rollmean(stateviol$perday,k=7, fill=NA)

protests <- protests %>% filter(protests$event_date < "2022-06-25")
stateviol <- stateviol %>% filter(stateviol$event_date < "2022-06-25")

# protestplot <- ggplot(data=protests, aes(x=event_date, y=rollmean)) +
#   geom_line()+
#   geom_smooth() +
#   geom_vline(xintercept=as.numeric(as.Date("2020-03-15")), color="red") +
#   scale_x_date(date_breaks = "6 months", date_labels = '%b %Y')  +
#   labs(x="Date", y="Protests", caption="7 day rolling averages; smoothed loess trend lines; red lines indicate 3.15.2020 COVID onset; 'ACLED data retrieved 7.18.22, http://acleddata.com") + 
#   ggtitle("Protests During the Pandemic") 
#    
# 
# protestplot


library(highcharter)
library(dplyr)

# First handle potential NA values and ensure data is properly ordered
protests <- protests %>%
  arrange(event_date) %>%
  # Optionally remove NA values if present
  filter(!is.na(rollmean), !is.na(event_date))

# Calculate the smoothed trend line with explicit handling of the data points
x_vals <- as.numeric(protests$event_date)
smooth_fit <- loess(rollmean ~ x_vals, data = protests, span = 0.75)
protests$smooth_trend <- predict(smooth_fit, x_vals)

bucolors<-list("#005A43","#6CC24A", "#A7DA92", "#BDBEBD", "#000000" )

# Create the highchart
hchart <- highchart() %>%
  hc_add_series(
    data = protests,
    type = "line",
    hcaes(x = event_date, y = rollmean),
    color= "#000000",
    name = "Rolling Average"
  ) %>%
  hc_add_series(
    data = protests,
    type = "line",
    hcaes(x = event_date, y = smooth_trend),
    name = "Trend",
    dashStyle = "Solid",
    color = "#005A43",
    opacity = 0.7
  ) %>%
  hc_xAxis(
    plotLines = list(list(
      value = as.numeric(as.POSIXct("2020-03-15")) * 1000,
      color = "red",
      width = 2,
      zIndex = 3
    )),
    type = "datetime",
    dateTimeLabelFormats = list(month = "%b %Y"),
    tickInterval = 6 * 30 * 24 * 3600 * 1000
  ) %>%
  hc_yAxis(
    title = list(text = "Protests")
  ) %>%
  hc_title(
    text = "Protests During the Pandemic"
  ) %>%
  hc_caption(
    text = "7 day rolling averages; smoothed loess trend lines; red line indicates 3.15.2020 COVID onset; ACLED data retrieved 7.18.22",
    align = "left",
    style = list(fontStyle = "italic")
  ) %>%
  hc_tooltip(
    shared = TRUE,
    crosshairs = TRUE
  ) %>%
  hc_credits(enabled = FALSE)

# Display the chart
hchart

Repression

code
# repressionplot <- ggplot(data=stateviol, aes(x=event_date, y=rollmean)) +
#   geom_line()+
#   geom_smooth() +
#   geom_vline(xintercept=as.numeric(as.Date("2020-03-15")), color="red") +
#   scale_x_date(date_breaks = "6 months", date_labels = '%b %Y')  +
#   labs(x="Date", y="Violence against civilians", caption="7 day rolling averages; smoothed loess trend lines; red lines indicate 3.15.2020 COVID onset; ACLED data retrieved 7.18.22, http://acleddata.com") + 
#   ggtitle("Violence Against Civilians During the Pandemic") 
#   
# repressionplot

# First handle potential NA values and ensure data is properly ordered
stateviol <- stateviol %>%
  arrange(event_date) %>%
  # Optionally remove NA values if present
  filter(!is.na(rollmean), !is.na(event_date))

# Calculate the smoothed trend line with explicit handling of the data points
x_vals <- as.numeric(stateviol$event_date)
smooth_fit <- loess(rollmean ~ x_vals, data = stateviol, span = 0.75)
stateviol$smooth_trend <- predict(smooth_fit, x_vals)

bucolors<-list("#005A43","#6CC24A", "#A7DA92", "#BDBEBD", "#000000" )

# Create the highchart
hchart <- highchart() %>%
  hc_add_series(
    data = stateviol,
    type = "line",
    hcaes(x = event_date, y = rollmean),
    color= "#000000",
    name = "Rolling Average"
  ) %>%
  hc_add_series(
    data = stateviol,
    type = "line",
    hcaes(x = event_date, y = smooth_trend),
    name = "Trend",
    dashStyle = "Solid",
    color = "#005A43",
    opacity = 0.7
  ) %>%
  hc_xAxis(
    plotLines = list(list(
      value = as.numeric(as.POSIXct("2020-03-15")) * 1000,
      color = "red",
      width = 2,
      zIndex = 3
    )),
    type = "datetime",
    dateTimeLabelFormats = list(month = "%b %Y"),
    tickInterval = 6 * 30 * 24 * 3600 * 1000
  ) %>%
  hc_yAxis(
    title = list(text = "Violence Against Civilians")
  ) %>%
  hc_title(
    text = "Violence Against Civilians During the Pandemic"
  ) %>%
  hc_caption(
    text = "7 day rolling averages; smoothed loess trend lines; red line indicates 3.15.2020 COVID onset; ACLED data retrieved 7.18.22",
    align = "left",
    style = list(fontStyle = "italic")
  ) %>%
  hc_tooltip(
    shared = TRUE,
    crosshairs = TRUE
  ) %>%
  hc_credits(enabled = FALSE)

# Display the chart
hchart

States of Emergency during the Pandemic

code
####################
#analysis data
####################

#make analysis data frame
analysisdata <- left_join(perday, soe, by=c("ccode"="ccode", "event_date"="event_date"))
analysisdata <- left_join(analysisdata, covid, by=c("ccode"="ccode", "event_date"="date"))

#zeros for NA in soe_s
analysisdata <- analysisdata %>% mutate(across(soe_s, ~replace_na(., 0)))

#fill soe series after onset
analysisdata <- analysisdata %>% group_by(ccode) %>% mutate(cumSOE = cumsum(soe_s)) %>% ungroup

#plot cumulative states of emergency over time
totalsoe <- analysisdata %>% group_by(event_date) %>%
  summarise(across(cumSOE, sum)) %>%
  ungroup() 

totalsoe$event_date <- as.Date(totalsoe$event_date)

totalsoe <- totalsoe %>% filter(event_date >"2020-01-01"& event_date <"2020-07-01")

ggplot() +
  geom_line(data=totalsoe, aes(x=event_date, y=cumSOE))+
  scale_x_date(date_breaks = "1 months", date_labels = '%b %Y') +
  labs ( colour = NULL, x = "Date", y =  "Active States of Emergency" )  


Modeling repression

Let’s build a simple model predicting violence against civilians. The data here are daily, covering 167 countries between January 2020 and July 2022. We’ll predict the 7 day moving average of repression depending on whether a state of emergency is in place.

\(~\)

My expectation is states of emergency will make repression easier to motivate and therefore, more frequent.


The model

\(y\) = violence against civilians

\(x\) = states of emergency

Controlling for:

  • new daily deaths from COVID (logged)
  • percentage of the population over 65 years old
  • the Human Development Index
  • the 7 day average number of protests.

Modeling Repression and SoEs

code
##################
#models
#################

analysisdata$soeprotests <- analysisdata$cumSOE*analysisdata$protest7day
analysisdata$lnd =  log(analysisdata$new_deaths+.01)
#label variables for coef plot
library("labelled")
analysisdata <- analysisdata %>%
  set_variable_labels(
    cumSOE = "State of Emergency",
    lnd = "ln(New COVID deaths)",
    aged_65_older = "% age 65 +",
    human_development_index = "Human Development Index",
    protest7day = "Protests, 7 day avg",
    soeprotests = "SoE * Protests, 7 day avg", 
    `Violence against civilians` = "Violence against Civilians"
  )

#model predicting repression during covid 

m1 <- lm(data=analysisdata, `Violence against civilians` ~ cumSOE  + lnd + aged_65_older+human_development_index + protest7day )
#summary(m1)  

library(sjPlot)
tab_model(m1, show.se=TRUE,  p.style="stars", show.ci=FALSE)
  Violence against
Civilians
Predictors Estimates std. Error
(Intercept) 0.93 *** 0.03
State of Emergency 0.26 *** 0.01
ln(New COVID deaths) 0.02 *** 0.00
%age 65+ -0.02 *** 0.00
Human Development Index -0.68 *** 0.04
Protests,7 day avg 0.04 *** 0.00
Observations 165211
R2 / R2 adjusted 0.057 / 0.057
* p<0.05   ** p<0.01   *** p<0.001

Another look

code
#coefficient plot
ggcoef_model(
  m1,show_p_values = FALSE,
  signif_stars = FALSE, exponentiate =FALSE, intercept=TRUE,
  colour=NULL, include = c("cumSOE", "lnd", "aged_65_older", "human_development_index", "protest7day")
)+
  xlab("Average Marginal Effects") +
  ggtitle("Predictors of Violence Against Civilians") 


Discussion

The estimates indicate:

  • if all the \(X\) variables are set to zero, the expected violence against civilians is 0.93 (the intercept). Can we reasonably set all the \(X\) variables to zero?

  • the difference between a state with a SoE and without is different from zero, but quite small; 0.26 uses of violence.

  • more deaths from COVID are associated with more violence against civilians; \(ln(\beta) = .02\). Exponentiate that, and the effect on violence is 1.02, so large compared to anything else in the model.


Quantities of interest (simulation)

Let’s generate predictions from the model by simulating the distribution of \(\widehat{\beta}\):

  • sample 1000 times from a multivariate normal distribution with mean \(\widehat{\beta}\)s and variances of \(var(\widehat{\beta})\).

  • plug those simulated \(\widehat{\beta}\)s into the equation using mean/median/mode values for the \(X\)s.


Simulating quantities

First, set \(SoE = 0\); vary \(covid~deaths = ln(1 \ldots 28000)\), \(\%~age~ 65 = 6\) (the median), \(HDI = .74\) (the median), and \(protests = 2\) (the median).

\[ \widehat{y} = \beta_0 + \beta_1* 0 +\beta_2*log(p) + \beta_3*6 +\beta_4*.74+ \beta_5*2 \]

then, repeat but with \(SoE = 1\):

\[ \widehat{y} = \beta_0 + \beta_1*1 +\beta_2*log(p) + \beta_3*6 +\beta_4*.74+ \beta_5*2 \]


Predicted Violence against Civilians

code
#simulated quantities
sigma <- vcov(m1)
B <- data.frame(rmvnorm(n=1000, mean=coef(m1), vcov(m1)))

colnames(B) <-c('b0', 'b1', 'b2', 'b3', 'b4', 'b5')

predictions <- data.frame ( lb0 = numeric(0),med0= numeric(0),
      ub0= numeric(0),lb1= numeric(0),med1= numeric(0),ub1= 
      numeric(0), deaths = numeric(0))

for (p in seq(0,28000,100)) {
  
  xbRA0  <- quantile(B$b0 + B$b1*0 + B$b2*log(p) + B$b3*6 +B$b4*.74+ 
                       B$b5*2, probs=c(.025,.5,.975))
  xbRA1 <- quantile(B$b0 + B$b1*1 + B$b2*log(p) + B$b3*6 +B$b4*.74+ 
                      B$b5*2, probs=c(.025,.5,.975))
  xbRA0<- data.frame(t(xbRA0))
  xbRA1<- data.frame(t(xbRA1))
  predictions[p:p,] <- data.frame(xbRA0, xbRA1, p)
}
predictions$deaths <- log(predictions$deaths)

#plot 
ggplot()+
  geom_ribbon(data=predictions, aes(x=deaths, ymin=lb0, ymax=ub0),fill = "grey70", alpha = .4, ) +
  geom_ribbon(data=predictions, aes(x=deaths, ymin=lb1, ymax=ub1), fill= "grey60",  alpha = .4, ) +
  geom_line(data= predictions, aes(x=deaths, y=med0))+
  geom_line(data= predictions, aes(x=deaths, y=med1))+
  labs ( colour = NULL, x = "ln(New COVID Deaths)", y =  "Expected Repression Episodes" ) +
  annotate("text", x = 5.5, y = .82, label = "State of Emergency", size=3.5, colour="gray30")+
  annotate("text", x = 6.5, y = .55, label = "No State of Emergency", size=3.5, colour="gray30")


Discussion

SoE states have higher levels of violence against civilians, but the difference is very small, making me think we haven’t found anything particularly interesting here.

  • Covid deaths is associated with more violence; though violence nearly doubles over the range of deaths (x-axis), the range is huge - from 1 to 28,000. If we transform the x-axis to deaths rather than \(ln(deaths)\), we can see this effect is pretty isolated.

Simulated effects (again)

code
predictions$deaths <- exp(predictions$deaths)

#plot 
ggplot()+
  geom_ribbon(data=predictions, aes(x=deaths, ymin=lb0, ymax=ub0),fill = "grey70", alpha = .4, ) +
  geom_ribbon(data=predictions, aes(x=deaths, ymin=lb1, ymax=ub1), fill= "grey60",  alpha = .4, ) +
  geom_line(data= predictions, aes(x=deaths, y=med0))+
  geom_line(data= predictions, aes(x=deaths, y=med1))+
  labs ( colour = NULL, x = "New COVID Deaths", y =  "Expected Repression Episodes" ) +
  annotate("text", x = 5000, y = 1, label = "State of Emergency", size=3.5, colour="gray30")+
  annotate("text", x = 10000, y = .65, label = "No State of Emergency", size=3.5, colour="gray30")


Discussion

The increase in violence occurs almost entirely between zero and 5000 COVID deaths. We can speculate on why this might be; doing so might lead us to wonder how useful this model is.

\(~\) \(~\)

Much of our focus this semester will be on evaluating how useful models are, and figuring out ways to build useful models.

Back to top